Workshop on Analyzing Spatial Transcriptomics Data

Phillip Nicol and Rafael Irizarry

Introduction

Outline

How to follow along

TODO: Will Provide GitHub Link to Download QMD and HTML. HOW TO INSTALL PACKAGES?

Review of Spatial transcriptomics (ST) technology

Many of the most popular technologies for ST can be divided into two groups:

  • Spot-based: Next generation sequencing (NGS) with spatial barcoding

  • Molecule-based: In-situ imaging of individual molecules

FOOTNOTE REF

Spot-based: 10X Genomics Visium

10X Genomics Visium Youtube Video

Molecule-based: Small molecule fluorescence in situ hybridization (smFISH)

  • Colored probes attach to mRNA transcript from a target gene.

  • Quantify expression by imaging

  • Key challenge is extending to many genes

Source: Figure 4A of Ni and Oudenaarden

Molecule-based: Multiplexed error robust FISH (MERFISH)

MERFISH uses (error robust) combinatorial labeling to increase the number of gene transcripts that can be measured.

Basic idea: assign a \(N\)-bit binary string to each gene. Then \(2^N\) genes can be encoded and measured after \(N\) sequential rounds of smFISH.

Source: Fig 1A of Chen et al. (2015). Science.

Comparison

Visium (Spot-based) Molecule-based
Spatial Resolution 1-10 cells per spot sub-cellular
Number of genes Whole transcriptome (20,000+) 100-10,000 (MERFISH)
Detection efficiency ~15% ~95%
Accessibility Commercially available; raw data FASTQ Raw images can be large and difficult to analyze

Most ST studies choose Visium

Moses and Pachter (2022). Nature Methods.

Most ST studies choose R

Moses and Pachter (2022). Nature Methods.

Spatial Transcriptomics Data Processing

Upstream data processing

Moses and Pachter (2022). Nature Methods.

Visium data format

Once processed, ST data can be represented using two matrices.

  • Count matrix \(Y\): For each of \(J\) spots record the number of reads from each of \(I\) genes.

  • Coordinate matrix \(X\): For each of \(J\) spots record the 2-dimensional spatial coordinate.

FIG

Storing ST data in R

The SpatialExperiment package provides a container to store the count and coordinate matrix. ADD REF

Creating a SpatialExperiment object

Installing and loading the SpatialExperiment package

Install the package:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("SpatialExperiment")

Load the package:

library(SpatialExperiment)

10X genomics Visium data

The 10X genomics website has example ST datasets.

For this workshop, we will download samples from human cerebral cortex:

Downloading from the website

The data can be downloaded by navigating to the website or directly from the terminal (Unix):

mkdir cortex
cd cortex 

curl -O https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Human_Brain_Section_1/V1_Human_Brain_Section_1_raw_feature_bc_matrix.tar.gz

curl -O https://cf.10xgenomics.com/samples/spatial-exp/1.1.0/V1_Human_Brain_Section_1/V1_Human_Brain_Section_1_spatial.tar.gz

tar xf V1_Human_Brain_Section_1_raw_feature_bc_matrix.tar.gz

tar xf V1_Human_Brain_Section_1_spatial.tar.gz

Exercise: Creating the object

dir <- "./cortex"

### Step 1: Load the count matrix
require(DropletUtils)
fnm <- file.path(dir, "raw_feature_bc_matrix")
sce <- DropletUtils::read10xCounts(fnm)

### Step 2: Load the tissue image
img <- readImgData(
    path = file.path(dir, "spatial"),
    sample_id = "brain")

# Step 3: Read spatial coordinates
fnm <- file.path(dir, "spatial", "tissue_positions_list.csv")
xy <- read.csv(fnm, header = FALSE,
    col.names = c(
        "barcode", "in_tissue", "array_row", "array_col",
        "pxl_row_in_fullres", "pxl_col_in_fullres"))
ix <- match(sce$Barcode, xy$barcode) #Get cells in correct order
xy <- xy[ix,]

### Step 4: Construct feature metadata 
require(S4Vectors)
rd <- S4Vectors::DataFrame(
    symbol = rowData(sce)$Symbol)

### Step 5: Create the object
spe <- SpatialExperiment(
    assays = list(counts = assay(sce)),
    rowData = rd, 
    colData = DataFrame(xy), 
    spatialCoordsNames = c("pxl_col_in_fullres", "pxl_row_in_fullres"),
    imgData = img,
    sample_id = "brain")

### Step 6: Save the object 
saveRDS(spe, file=file.path(dir,"spe.RDS"))

Plotting an image of the tissue

spi <- getImg(spe)
plot(as.raster(spi))

Removing spots not in the tissue

require(ggplot2)

df <- data.frame(x=spatialCoords(spe)[,1], y=spatialCoords(spe)[,2],
                 in_tissue=as.character(spe$in_tissue))
p <- ggplot(df,aes(x=x,y=y,color=in_tissue))+
  geom_point(size=0.25)+
  scale_color_manual(values=c("grey", "red"))+
  theme_bw()
p

spe <- spe[,spe$in_tissue == 1] #This is how you subset by spots/samples

More example ST data

For additional examples of data in the SpatialExperiment format, a good package is STexampleData :

BiocManager::install("STexampleData")

### Example dataset 
spe2 <- STexampleData::ST_mouseOB()

Finding Spatially Variable Genes

What is a spatially variable gene (SVG)?

A spatially variable gene is one whose expression depends on spatial location

Source: REF

Exercise: Plotting spatially variable genes

In the human cerebral cortex, the genes MOBP, NPY, SNAP25, and PCP4 were previously reported by REF to be spatially variable. We will verify this by plotting.

Problem: Instead of gene symbols we are given ENSEMBL ID:

head(rownames(spe))
[1] "ENSG00000243485" "ENSG00000237613" "ENSG00000186092" "ENSG00000238009"
[5] "ENSG00000239945" "ENSG00000239906"

Exercise: Plotting SVGs

### Convert ENSEMBL ID to gene symbol
require(biomaRt)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
genes <- rownames(spe)
G_list <- getBM(filters="ensembl_gene_id",
                attributes= c("ensembl_gene_id","hgnc_symbol"),
                values=genes,
                mart=mart)
no.match <- which(G_list$hgnc_symbol == "")
G_list[no.match,2] <- G_list[no.match,1]
ix <- match(G_list$ensembl_gene_id, rownames(spe))
rownames(spe)[ix] <- G_list$hgnc_symbol
interesting_genes <- which(rownames(spe) %in% c("MOBP","PCP4", "SNAP25","NPY"))
### Normalize count matrix and subset to interesting genes
require(scran)
spe <- logNormCounts(spe)

Y.sub <- as.matrix(logcounts(spe)[interesting_genes,])

## Switch rows and columns and make dataframe 
df <- as.data.frame(t(Y.sub))

## Add spatial coordinates 
df$x <- spatialCoords(spe)[,1]; df$y <- spatialCoords(spe)[,2]
### Construct plot
require(reshape2)
df <- melt(df, id.vars=c("x","y"))

p <- ggplot(data=df,aes(x=x,y=y,color=value)) 
p <- p + geom_point(size=0.05) 
p <- p + coord_fixed()
p <- p + scale_color_gradientn(colors=c("gray90","blue","black"),
                              breaks=c(0,3,6))
p <- p + facet_wrap(~variable, nrow=2)
p <- p + theme_bw()
p <- p + guides(color = guide_colorbar(ticks = FALSE)) 
p <- p + labs(color="log count")
p <- p + theme(strip.text = element_text(face = "italic"), 
        panel.grid = element_blank(), 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank())

Finding SVGs automatically

  • A statistical method can automatically identify spatially variable genes without relying on previous knowledge.

  • There are several existing methods, but it is still an area of active research.

SPARK-X

METHOD OVERVIEW

install.packages('devtools') #If not already installed
devtools::install_github('xzhoulab/SPARK')

Running SPARK-X

require(SPARK)

## Remove mitochondrial genes
mito.genes <- which(grepl("^MT-",rownames(spe)))
spe <- spe[-mito.genes,]

res <- sparkx(
    count_in = counts(spe)[rowSums(counts(spe)) > 10,], 
    locus_in = spatialCoords(spe), 
    verbose = FALSE)

Genes with the smallest p-values

require(scran)
interesting_genes <- rownames(res$res_mtest)[order(res$res_mtest$combinedPval)[1:10]]

spe <- logNormCounts(spe)

Y.sub <- as.matrix(logcounts(spe)[interesting_genes,])

## Switch rows and columns and make dataframe 
df <- as.data.frame(t(Y.sub))

## Add spatial coordinates 
df$x <- spatialCoords(spe)[,1]; df$y <- spatialCoords(spe)[,2]

### Construct plot
require(reshape2)
df <- melt(df, id.vars=c("x","y"))

p <- ggplot(data=df,aes(x=x,y=y,color=value)) 
p <- p + geom_point(size=0.05) 
p <- p + coord_fixed()
p <- p + scale_color_gradientn(colors=c("gray90","blue","black"),
                              breaks=c(0,4,8))
p <- p + facet_wrap(~variable, nrow=2)
p <- p + theme_bw()
p <- p + guides(color = guide_colorbar(ticks = FALSE)) 
p <- p + labs(color="log count")
p <- p + theme(strip.text = element_text(face = "italic"), 
        panel.grid = element_blank(), 
        axis.title = element_blank(), 
        axis.text = element_blank(), 
        axis.ticks = element_blank())
p

Plotting genes with the largest p-values (least spatially variable)

Dimension Reduction and Clustering

Dimension Reduction

Our dataset has over \(30,000\) genes. However, many appear to be highly correlated (see the top SVG plot we made earlier).

If we account for the strong correlation, we may be able to summarize the data using a much smaller number of genes, thus reducing the “dimension”.

Principal Component Analysis (PCA)

Informally, PCA begins by finding the line that best approximates the data (PC dimension 1). Next, it finds a line perpendicular to PC dimension 1 that best approximates the data (PC dimension 2). Then repeat to find PC dimension 3, 4, etc.

Image source

Principal Component Analysis (PCA)

To reduce the data to dimension \(k\), we can project each data point onto the linear space spanned by the first \(k\) PC dimensions.

For example, when \(k = 1\) we project each point onto a line. When \(k = 2\), we project each point onto a plane.

Running PCA

For computational speed, it is standard practice to subset to the ~2000 most highly variable genes before running PCA:

top.hvgs <- getTopHVGs(spe, n=2000) #Get the top 2000 genes
spe <- fixedPCA(spe, subset.row=top.hvgs) #Running PCA 

Finding the genes in PC dimension 1

What do you notice about the genes that contribute to PC dimension 1?

pc_dims <- attr(reducedDim(spe), "rotation")

pc_dim1 <- pc_dims[,1]

#Top 5 genes with positive weight
names(pc_dim1)[order(pc_dim1, decreasing=TRUE)[1:5]]
[1] "NEFL"   "SNAP25" "NEFM"   "SYT1"   "VSNL1" 
#Top 5 genes with negative weight
names(pc_dim1)[order(pc_dim1)[1:5]]
[1] "MOBP"  "MBP"   "PLP1"  "GFAP"  "BCAS1"

Projecting onto the first 3 PC dimensions

df <- data.frame(x=spatialCoords(spe)[,1],
                 y=spatialCoords(spe)[,2],
                 PC1=reducedDim(spe)[,1],
                 PC2=reducedDim(spe)[,2],
                 PC3=reducedDim(spe)[,3])

df <- df |> pivot_longer(cols=c(PC1,PC2,PC3))
p <- ggplot(data=df,aes(x=x,y=y,color=value)) + 
  geom_point(size=0.5) + 
  coord_fixed()+
  scale_color_gradient2(low="blue",mid="grey", high="red") +
  theme_bw()+
  facet_wrap(~name,nrow=1)
p

Clustering

The PCA results show that there are groups of spots that have very similar expression levels. By applying a clustering algorithm we can explicitly define these groups of cells. Many approaches for clustering first build the shared nearest neighbor (SNN) graph of the cells.

INSERT FIG

Building the SNN graph and applying a community detection algorithm

require(igraph)
g <- buildSNNGraph(spe, use.dimred="PCA")
cluster <- igraph::cluster_walktrap(g)$membership

Plotting the clusters

df <- data.frame(x=spatialCoords(spe)[,1],
                 y=spatialCoords(spe)[,2],
                 cluster=as.character(cluster))
p <- ggplot(data=df,aes(x=x,y=y,color=cluster)) + 
  geom_point() + 
  theme_bw()
p

Finding marker genes

A marker gene is one that is differentially expressed in a particular cluster. Marker genes are used to better understand the biological function of the identified clusters.

colLabels(spe) <- factor(cluster)
markers <- scoreMarkers(spe)

Printing the top 5 markers in each cluster

for(i in 1:length(markers)) {
  print(paste("Cluster", i))
  print(rownames(markers[[i]])[order(markers[[i]]$mean.AUC, decreasing=TRUE)[1:5]])
}
[1] "Cluster 1"
[1] "SNAP25" "SYT1"   "NEFL"   "VSNL1"  "NRGN"  
[1] "Cluster 2"
[1] "MBP"   "MOBP"  "PLP1"  "BCAS1" "SHTN1"
[1] "Cluster 3"
[1] "MBP"   "PLP1"  "MOBP"  "CNP"   "MTURN"
[1] "Cluster 4"
[1] "SLC1A2" "MAP1A"  "OLFM1"  "ENC1"   "NEFL"  
[1] "Cluster 5"
[1] "ENSG00000269028" "ENSG00000255823" "CXCL14"          "GLUL"           
[5] "CST3"           
[1] "Cluster 6"
[1] "MAP1A"   "MAP1B"   "SNAP25"  "SYT1"    "SLC17A7"
[1] "Cluster 7"
[1] "SPARC"   "AQP4"    "FOS"     "SPARCL1" "CLU"    

Top marker genes

Per REF,

  • MOBP is a white matter/oligodendrocyte marker (Cluster 4)

  • SNAP25 is a gray matter/neuron marker (Cluster 2)

References

Ji, N., & Van Oudenaarden, A. (2012). Single molecule fluorescent in situ hybridization (smFISH) of C. elegans worms and embryos.